Softmax regression

Also called multinomial logistic regression, is a generalization of logistic regression to the case where we want to handle multiple classes.

Specification

Hypothesis:
(x: m×d, θ: d×K)

hθ(x)=P(y=1|x;θ)P(y=2|x;θ)P(y=K|x;θ)=1Kj=1exp(θ(j)x)exp(θ(1)x)exp(θ(2)x)exp(θ(K)x)

Cost function (cross entropy):
J(θ)=i=1mk=1K1{y(i)=k}logexp(θ(k)x(i))Kj=1exp(θ(j)x(i))

Optimize with gradient descent:
θ(k)J(θ)=i=1m[x(i)(1{y(i)=k}P(y(i)=k|x(i);θ))]

Redundancy of Parameters

The actual number of parameters are just (K1)n, rather than Kn, because probabilities always sum to 1. We can find that subtracting ψ from every θ(j) does not affect our hypothesis’ predictions at all:

P(y(i)=k|x(i);θ)=exp((θ(k)ψ)x(i))Kj=1exp((θ(j)ψ)x(i))=exp(θ(k)x(i))exp(ψx(i))Kj=1exp(θ(j)x(i))exp(ψx(i))=exp(θ(k)x(i))Kj=1exp(θ(j)x(i)).

So one can instead set θ(K)=0⃗  and optimize only with respect to the remaining parameters.
Note that J(θ) is still convex, but the Hessian is singular, which causes a straightforward implementation of Newton’s method to run into numerical problems.

Reference

http://ufldl.stanford.edu/tutorial/supervised/SoftmaxRegression/